NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

WfBench: Automated Generation of Scientific Workflow Benchmarks

Coleman, T.; Casanova, H.; Maheswari, K.; Pottier, L.; Wilkinson, S.; Wozniak, J.; Suter, F.; Shankar, M.; Ferreira da Silva, R. (January 2022, Proc. of the 13th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer System (PBMS))

The prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class high-performance computing (HPC) clusters. To handle the deployment, monitoring, and optimization of workflow executions, many workflow systems have been developed over the past decade. There is a need for workflow benchmarks that can be used to evaluate the performance of workflow systems on current and future software stacks and hardware platforms. We present a generator of realistic workflow benchmark specifications that can be translated into benchmark code to be executed with current workflow systems. Our approach generates workflow tasks with arbitrary performance characteristics (CPU, memory, and I/O usage) and with realistic task dependency structures based on those seen in production workflows. We present experimental results that show that our approach generates benchmarks that are representative of production workflows, and conduct a case study to demonstrate the use and usefulness of our generated benchmarks to evaluate the performance of workflow systems under different configuration scenarios.
more » « less
Full Text Available
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

https://doi.org/10.1109/CCGrid49817.2020.00-76

Nicolae, B.; Li, J.; Wozniak, J..; Bosilca, G.; Dorier, M.; Cappello, F. (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
more » « less
Full Text Available
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae, B.; Li, J.; Wozniak, J. M.; Bosilca, G.; Dorier, M.; Cappello, F. (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))
null (Ed.)
Full Text Available
Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training,

https://doi.org/10.1109/MLHPC49564.2019.00006

Li, J.; Nicolae, B.; Wozniak, J. M.; Bosilca, G. (November 2019, 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC))
null (Ed.)
Full Text Available

Search for: All records